Goto

Collaborating Authors

 visual entity relationship graph


Language and Visual Entity Relationship Graph for Agent Navigation

Neural Information Processing Systems

Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional cues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art. On the Room-to-Room (R2R) benchmark, our method achieves the new best performance on the test unseen split with success rate weighted by path length of 52%. On the Room-for-Room (R4R) dataset, our method significantly improves the previous best from 13% to 34% on the success weighted by normalized dynamic time warping.


Appendix: Language and Visual Entity Relationship Graph for Agent Navigation

Neural Information Processing Systems

Replicating the encoding by 32 times does not enrich its information but makes its gradient 32 times larger during back-propagation. We suspect that this benefits the agent to learn about the action-related terms (e.g. " turn left, " go forward ") in the


Review for NeurIPS paper: Language and Visual Entity Relationship Graph for Agent Navigation

Neural Information Processing Systems

Weaknesses: - The proposed method is tailored for VLN and may limit its generalization to other domains (it is not new for other vision-and-language tasks). If the same h_t and u are feed into the three attentions, how could different contexts be learned? There seems to be something wrong, either the technique or the notations. However, VLN models may be sensitive to hyper-parameter tuning. It would be better if the authors can demonstrate the mean and standard deviation of multiple runs. In what cases the proposed model would fail?


Language and Visual Entity Relationship Graph for Agent Navigation

Neural Information Processing Systems

Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional cues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.

  agent navigation, visual entity relationship graph